Genome Research
● Cold Spring Harbor Laboratory
Preprints posted in the last 30 days, ranked by how well they match Genome Research's content profile, based on 409 papers previously published here. The average preprint has a 0.15% match score for this journal, so anything above that is already an above-average fit.
Epain, V.; Mane, A.; Della Vedova, G.; Bonizzoni, P.; Chauve, C.
Show abstract
We address the problem of plasmid binning, that aims to group contigs - from a draft short-read assembly for a bacterial sample - into bins each expected to correspond to a plasmid present in the sequenced bacterial genome. We formulate the plasmid binning problem as a network multi-flow problem in the assembly graph and describe a Mixed-Integer Linear Program to solve it. We compare our new method, PlasBin-HMF, with state-of-the-art methods,MOB-recon, gplasCC, and PlasBin-flow, on a dataset of more than 500 bacterial samples, and show that PlasBin-HMF outperforms the other methods, by preserving the explainability.
Durbin, R.
Show abstract
Skiplists (Pugh, 1990) are probabilistic data structures over ordered lists supporting [O] (log N) insertion and search, which share many properties with balanced binary trees. Previously we introduced the graph Burrows-Wheeler transform (GBWT) to support efficient search over pangenome path sets, but current implementations are static and cumbersome to build and use. Here we introduce two doubly-linked skiplist variants over run-length-compressed BWTs that support [O] (log N) rank, access and insert operations. We use these to store and search over paths through a syncmer graph built from Edgars closed syncmers, equivalent to a sparse de Bruijn graph. Code is available in rskip.[ch] within the syng package at github.com/richarddurbin/syng. This builds a 5.8 GB lossless GBWT representation of 92 full human genomes (280Gbp including all centromeres and other repeats) single-threaded in 52 minutes, on top of a 4GB 63bp syncmer set built in 37 minutes. Arbitrarily long maximal exact matches (MEMs) can then be found as seeds for sequence matches to the graph at a search rate of approximately 1Gbp per 10 seconds per thread.
Plachy, J.; Sladky, O.; Brinda, K.; Vesely, P.
Show abstract
The growing interest in k-mer-based methods across bioinformatics calls for compact k-mer set representations that can be optimized for specific downstream applications. Recently, masked superstrings have provided such flexibility by moving beyond de Bruijn graph paths to general k-mer superstrings equipped with a binary mask, thereby subsuming Spectrum--Preserving String Sets and achieving compactness on arbitrary k-mer sets. However, existing methods optimize superstring length and mask properties in two separate steps, possibly missing solutions where a small increase in superstring length yields a substantial reduction in mask complexity. Here, we introduce the first method for Pareto optimization of k-mer superstrings and masks, and apply it to the problem of compressing pan-genome k-mer sets. We model the compressibility of masked superstrings using an objective that combines superstring length and the number of runs in the mask. We prove that the resulting optimization problem is NP-hard and develop a heuristic based on iterative deepening search in the Aho-Corasick automaton. Using microbial pan-genome datasets, we characterize the Pareto front in the superstring-length/mask-run space and show that the front contains points that Pareto-dominate simplitigs and matchtigs, while nearly encompassing the previously studied greedy masked superstrings. Finally, we demonstrate that Pareto-optimized masked superstrings improve pan-genome k-mer set compressibility by 12-19% when combined with neural-network compressors.
Weir, J. A.; Krebs, Y.; Chen, F.
Show abstract
Probe-based single cell RNA sequencing approaches are increasingly becoming a technology of choice for profiling gene expression at scale and in archival tissues. The 10x Genomics Flex v1 assay enables cost-effective and high-sensitivity single-cell RNA sequencing by splitting samples across up to 16 uniquely barcoded probe sets before pooling and loading onto a single lane of a microfluidic chip. A natural consequence of this design is to leverage probe set barcoding as a sample barcoding strategy for case-control experiments. However, we observed that Flex v1 probe set barcode identity drives substantial technical variation between probe set barcodes, an effect that is reproducible across lanes and independent datasets. When Flex v1 probe set barcodes are confounded with biological sample identity, a concerning number of differentially expressed genes at standard thresholds are false positives. The Flex v2 assay, which decouples sample barcoding from probe set hybridization, significantly reduces this artifact. As the field continues to expand adoption of probe-based assays, our findings introduce probe set barcoding as an underappreciated source of technical variation in single-cell assays and emphasize the importance of experimental design when using probe-based sequencing technologies.
Ghoreishi, S. A.; Szmigiel, A. W.; Nagai, J. S.; Gesteira Costa Filho, I.; Zimek, A.; Campello, R. J. G. B.
Show abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to resolve cellular heterogeneity across thousands to millions of cells. A major challenge is to identify biologically meaningful cell populations while preserving their hierarchical organization, because broad cell types frequently split into more specialized subtypes. However, state-of-the-art approaches mostly focus on flat partitions and ignore the hierarchical structure of single-cell data. Here we introduce GraphHDBSCAN*, a graph-based, hyperparameter-free extension of HDBSCAN* that performs hierarchical density-based clustering on a graph representation of the data, enabling robust recovery of both single-level and hierarchical relationships in high-dimensional and sparse datasets. We evaluate GraphHDBSCAN* across multiple scRNA-seq datasets and show that it recovers biologically meaningful hierarchies that reveal fine-grained structure in complex data, including monocyte subpopulations. In addition, the method yields high-quality flat partitions that outperform widely used community-detection methods.
Yang, Y.
Show abstract
The rapid growth of T-cell receptor (TCR) sequencing data has created an urgent need for computational methods that can efficiently search CDR3 sequences at scale. Existing approaches either rely on exact pairwise distance computation, which scales quadratically with repertoire size, or employ heuristic grouping that sacrifices sensitivity. Here we present TCRseek, a two-stage retrieval framework that combines biologically informed sequence embeddings with approximate nearest neighbor (ANN) indexing for scalable search over TCR repertoires. TCRseek first encodes CDR3 amino acid sequences into fixed-length numerical vectors through a multi-scale windowed k-mer embedding scheme derived from BLOSUM62 eigendecomposition, then indexes these vectors using FAISS-based structures (IVF-Flat, IVF-PQ, or HNSW-Flat) that support sublinear-time search. A second-stage reranking module refines the shortlisted candidates using exact sequence alignment scores (Needleman-Wunsch with BLOSUM62), Levenshtein distance, or Hamming distance. We benchmarked TCRseek against tcrdist3, TCRMatch, and GIANA on a 100,000-sequence corpus with precomputed exact ground truth under three distance metrics. Under cross-metric evaluation--where the reranking and ground truth metrics differ, providing the most informative test of generalization--TCRseek achieved NDCG@10 = 0.890 (Levenshtein ground truth) and 0.880 (Hamming ground truth), ranking highest among the retained baselines under Hamming and remaining competitive with tcrdist3 (0.894) under Levenshtein. When the reranking metric matches the ground truth definition (BLOSUM62 alignment), NDCG@10 reached 0.993, confirming that the ANN shortlist captures >99% of true neighbors--the expected ceiling of the two-stage design. On the 100,000-sequence corpus, TCRseek achieved 3.6-39.6x speedup over exact brute-force search depending on index type and distance metric, with the largest gains for alignment-based retrieval. These results demonstrate that embedding-based ANN search provides a practical and scalable alternative for TCR repertoire analysis.
Hartman, A.; Blair, J. D.; Nguyen, T. P.; Dyson, K.; Bradu, A.; Takacsi-Nagy, O.; Santostefano, K.; Boade, T.; Bolanos, M.; Zhu, R.; Dann, E.; Marson, A.; Gitler, A.; Satija, R.; Satpathy, A. T.; Roth, T. L.
Show abstract
Genome-wide Perturb-seq (GWPS) has emerged as a powerful approach for unbiased mapping of gene regulatory networks. A key assumption underlying many Perturb-seq analyses is that each guide RNA exclusively perturbs a single target locus. Without methods to identify and filter off-target events, erroneous gene-pathway associations driven by off-target activity can propagate into downstream analyses. Here, we present a workflow for the systematic identification of candidate off-target events in CRISPRi Perturb-seq experiments. Our approach exploits the observation that cells harboring a guide which represses an off-target gene display transcriptional similarity to cells in which that gene is directly targeted by an on-target guide. We apply our workflow to multiple GWPS datasets and nominate off-target events in which a guide nominally targeting one gene also represses a distinct gene producing a phenotype likely attributable to the off-target perturbation. We use both off-target gene repression and guide seed sequence alignments at the off-target promoter locus as evidence for off-target effects and find independent evidence of putative off-target events in separate GWPS datasets. Together, these results establish a principled framework for the identification and filtering of off-target guide effects in Perturb-seq experiments.
Conchon-Kerjan, E.; Rouze, T.; Robidou, L.; Ingels, F.; Limasset, A.
Show abstract
Approximate membership query structures are used throughout sequence bioinformatics, from read screening and metagenomic classification to assembly, indexing, and error correction. Among them, Bloom filters remain the default choice. They are not the most efficient structures in either time or memory, but they provide an effective compromise between compactness, speed, simplicity, and dynamic insertions, which explains their widespread adoption in practice. Their main drawback is poor cache locality, since each query typically requires several random memory accesses. Blocked Bloom filters alleviate this issue by restricting accesses for any given element to a single memory block, but this usually comes with a loss in accuracy at fixed memory. In this work, we introduce the Super Bloom Filter, a Bloom filter variant designed for streaming k-mer queries on biological sequences. Super Bloom uses minimizers to group adjacent k-mers into super-k-mers and assigns all k-mers of a group to the same memory block, thereby amortizing random accesses over consecutive k-mer queries and improving cache efficiency. We further combine this layout with the findere scheme, which reduces false positives by requiring consistent evidence across overlapping subwords. We provide a theoretical analysis of the construction of Super Bloom filters, showing how minimizer density controls the expected reduction in memory transfers, and derive a practical parameterization strategy linking memory budget, block size, collision overhead, and the number of hash functions to robust false-positive control. Across a broad range of memory budgets and numbers of hash functions, Super Bloom consistently outperforms existing Bloom filter implementations, with several-fold time improvements. As a practical validation, we integrated it into a Rust reimplementation of BioBloom Tools, a sequence screening tool that builds filters from reference genomes and classifies reads through k-mer membership queries for applications such as host removal and contamination filtering. This replacement yields substantially faster indexing and querying than both the original C++ implementation and Rust variants based on Bloom filters and blocked Bloom filters. The findere scheme also reduces false positives by several orders of magnitude, with some configurations yielding no observed false positives among 109 random queried k-mers. Code is available at https://github.com/EtienneC-K/SuperBloom and https://github.com/Malfoy/SBB.
Poggiali, B.; Putzeys, L.; Andersen, J. D.; Vidaki, A.
Show abstract
SummaryThe human genome is dominated by repetitive DNA, whose genetic and epigenetic variation plays a key role in gene regulation, genome stability, and disease. Recent advances in long-read sequencing now enable large-scale, haplotype-resolved, and DNA methylation-informative analysis of the human genome, including on previously inaccessible complex and repetitive regions. However, the comprehensive, simultaneous characterisation of the "human repeatome" remains challenging, largely due to the lack of comprehensive tools integrated in a single pipeline that can capture the full spectrum of variation across diverse types of DNA repeats. Here, we present ECHO, a user-friendly, Snakemake-based pipeline for the "(Epi)genomic Characterisation of Human Repetitive Elements using Oxford Nanopore Sequencing". ECHO provides a reproducible and scalable framework for end-to-end analysis of whole-genome nanopore sequencing data, enabling integrative but also tailored (epi)genetic analyses of the human repeatome. Availability and implementationECHO is freely available at Github: https://github.com/leenput/ECHO-pipeline, with the archived version at Zenodo: https://zenodo.org/records/19068468 Contactathina.vidaki@mumc.nl; athina.vidaki@maastrichtuniversity.nl
BAI, J.; Yang, R.
Show abstract
By mapping ribosome-protected fragments (RPFs) genome-wide, ribosome profiling (Ribo-seq) has uncovered extensive translation beyond conventional coding sequences, revealing non-canonical ORFs (ncORFs) with emerging roles in diverse biological processes. However, protocol-induced biases introduced during library construction can substantially distort RPF signals. Most existing ORF callers are not designed to explicitly account for such artifacts, limiting robust ncORF identification. Here, we present RiboBA, a bias-aware probabilistic framework to address this challenge. RiboBA consists of two main components: a generative module that recovers protocol-induced biases and codon-level ribosome occupancy, and a supervised module that identifies translated ORFs and initiation sites using the resulting bias-adjusted profiles. Evaluated through simulations and on a range of Ribo-seq datasets--particularly supported by cell-type-specific immunopeptidomics--RiboBA robustly recovers protocol-induced parameters and achieves superior accuracy and sensitivity in ncORF identification. Notably, RiboBA performs particularly well on RNase I libraries with attenuated three-nucleotide periodicity, as well as on MNase and nuclease P1 libraries, while maintaining competitive runtimes. In a Drosophila case study, RiboBA identifies conserved ncORFs with coding potential, including recurrent upstream translation of ThrRS and Mettl2 that suggests a potential threonine-specific translational control axis. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=89 SRC="FIGDIR/small/712439v1_ufig1.gif" ALT="Figure 1"> View larger version (26K): org.highwire.dtl.DTLVardef@1ee4f67org.highwire.dtl.DTLVardef@9f11eeorg.highwire.dtl.DTLVardef@1522de9org.highwire.dtl.DTLVardef@443d7f_HPS_FORMAT_FIGEXP M_FIG C_FIG
Dip, S. A.; Zhang, L.
Show abstract
Predicting transcriptional responses to genetic perturbations is a central challenge in functional genomics. CRISPR Perturb-seq experiments measure gene expression changes induced by targeted perturbations, yet experimentally testing all possible perturbations remains infeasible. Computational models that infer responses for unseen perturbations are therefore essential for scalable functional discovery. We introduce PerturbGraph, a biologically informed graph-learning framework for predicting transcriptional responses of unseen gene perturbations by integrating interaction networks, functional annotations, and transcriptional features. Our approach is motivated by the observation that perturbation effects propagate through molecular interaction networks and manifest as coordinated transcriptional programs. Starting from single-cell CRISPR perturbation data, we construct perturbation signatures representing expression shifts relative to control cells and project them into a compact latent program space that captures stable transcriptional variation while reducing noise. Each gene is represented using enriched biological features integrating protein-protein interaction network embeddings, network topology statistics, baseline transcriptional characteristics, and Gene Ontology annotations. A graph neural network propagates information across the interaction network to infer perturbation programs for genes whose effects are not observed during training. Across unseen-perturbation benchmarks, PerturbGraph consistently outperforms classical machine learning models, perturbation-specific deep learning approaches such as scGen and CPA, and alternative graph neural architectures. The model achieves up to 6% improvement in cosine similarity over strong tree-based baselines and more than 20% improvement over linear models while improving recovery of differentially expressed genes. These results show that integrating biological interaction networks with graph representation learning enables accurate prediction of transcriptional effects for previously unobserved genetic perturbations. Code is publicly available at https://github.com/Sajib-006/PerturbGraph.
Hendrychova, V.; Brinda, K.
Show abstract
One important question in bacterial genomics is how to represent and search modern million-genome collections at scale. Phylogenetic compression effectively addresses this by guiding compression and search via evolutionary history, and many related methods similarly rely on tree- and ordering-based heuristics that leverage the same underlying phylogenetic signal. Yet, the mathematical principles underlying phylogenetic compression remain little understood. Here, we introduce the first formal framework to model phylogenetic compression mechanisms. We study genome collections represented as RLE-compressed SNP, k-mer, unitig, and uniq-row matrices and formulate compression as an optimization problem over genome orderings. We prove that while the problem is NP-hard for arbitrary data, for genomes following the Infinite Sites Model it becomes optimally solvable in polynomial time via Neighbor Joining (NJ). Finally, we experimentally validate the models predictions with real bacterial datasets using an exact Traveling Salesperson Problem (TSP). We demonstrate that, despite numerous simplifying assumptions, NJ orderings achieve near-optimal compression across dataset types, representations, and k-mer ranges. Altogether, these results explain the mathematical principles underlying the efficacy of phylogenetic compression and, more generally, the success of tree-based compression and indexing heuristics across bacterial genomics.
Torma, G.; Balazs, Z.; Fulop, A.; Tombacz, D.; Boldogkoi, Z.
Show abstract
Long-read RNA sequencing (lrRNA-seq) enables direct reconstruction of full-length transcripts, yet existing annotation tools show variable performance across genomes and library chemistries, particularly for novel isoforms. We present LoRTIA Plus, a chemistry-agnostic suite for transcriptome annotation and reconstruction from lrRNA-seq data. LoRTIA Plus first detects and filters transcription start sites (TSSs), transcription end sites (TESs), and introns using adapter-aware and quality-based criteria, and evaluates read support before assembling high-confidence transcript models. We benchmarked LoRTIA Plus against bambu, FLAIR, IsoQuant, and NAGATA on KSHV transcriptomes with dense overlap, using a validated literature-supported boundary set, and on transcriptomes from three human cell lines from the Long Read RNA-seq Genome Annotation Assessment Project (LRGASP) sequenced with five long-read chemistries. On KSHV, LoRTIA Plus achieved the highest F1 scores for TSSs, TESs, and transcripts in both direct-cDNA and direct-RNA datasets by improving recall without sacrificing precision. Across human datasets, LoRTIA Plus consistently ranked among the top boundary annotators across all chemistries and was the best-performing tool in PCR-based libraries, while remaining highly competitive on native RNA. Junction- and isoform-level analyses show that LoRTIA Plus yields a rich, reproducible repertoire of novel isoforms and transcript boundaries from viral to human transcriptomes.
Hearne, G.; Refahi, M. S.; Polikar, R.; Rosen, G. L.
Show abstract
Transformer-based Genomic Language Models (GLMs) have achieved strong performance across diverse genomic prediction tasks. However, their tendency toward overconfident predictions--particularly on noisy or unfamiliar data--limits reliability. In genomics, where unknown species and novel variants are common, developing models robust to distribution shift is crucial for dependable predictions. Here, we analyze the impact of several common and novel uncertainty quantification (UQ) methods in the context of GLMs, evaluating their performance across diverse downstream genomic and metagenomic prediction tasks. Comparing model behavior on both in-distribution (ID) and out-of-distribution (OOD) data, we show that temperature scaling and epistemic neural networks are capable of improving classification reliability across multiple GLM architectures and domains. The software is available at: https://github.com/EESI/glm-epinet-pyt
Zhang, S.; Lu, Y.; Luo, Q.; An, L.
Show abstract
Identifying cell type-specific expressed genes (marker genes) is essential for understanding the roles and interactions of cell populations within tissues. To achieve this, the traditional differential analysis approaches are often applied to individual cell-type bulk RNA-seq and single-cell RNA-seq data. However, real-world datasets often pose challenges, such as heterogeneous bulk RNA-seq and incomplete scRNA-seq. Heterogeneous bulk RNA-seq amalgamates gene expression profiles from multiple cell types and results in low resolution, while incomplete scRNA-seq does not capture some cell types from the tissue, leading to unknown cell types. Traditional methods fail to identify marker genes for such unknown cell types. MiCBuS addresses this limitation by generating Dirichlet-pseudo-bulk RNA-seq based on bulk and incomplete single-cell RNA-seq data. By performing differential analysis of gene expressions on bulk and Dirichlet-pseudo-bulk RNA-seq samples, MiCBuS can identify the marker genes of unknown cell types, enabling the identification and characterization of these elusive cellular components. Simulation studies and real data analyses demonstrate that MiCBuS reliably and robustly identifies marker genes specific to unknown cell types, a capability that traditional differential analysis methods cannot achieve. Availability and implementationMiCBuS is implemented in the R language and freely available at https://github.com/Shanshan-Zhang/MiCBuS.
Etzioni, Z.; Zhao, L.; Hertleif, P.; Schuster-Boeckler, B.
Show abstract
Cytosine methylation is a crucial epigenetic mark that impact tissue-specific chromatin conformation and gene expression. For many years, bisulfite sequencing (BS-seq), which converts all non-methylated cytosine (C) to thymine (T), remained the only approach to measure cytosine methylation at base resolution. Recently, however, several new methods that convert only methylated cytosines to thymine (mC[->]T) have become widely available. Here we present rastair, an integrated software toolkit for simultaneous SNP detection and methylation calling from mC[->]T sequencing data such as those created with Watchmakers TAPS+ and Illuminas 5-Base chemistries. Rastair combines machine-learning-based variant detection with genotype-aware methylation estimation. Using NA12878 benchmark datasets, we show that rastair outperforms existing methylation-aware SNP callers and achieves F1 scores exceeding 0.99 for datasets above 30x depth, matching the accuracy of state-of-the-art tools run on whole-genome sequencing data. At the same time, rastair is significantly faster than other genetic variant callers, processing a 30x depth file takes less than 30 minutes given 32 CPU cores on an Intel Xeon, and half as long when a GPU is available. By integrating genotyping with methylation calling, rastair reports an additional 500,000 positions in NA12878 where a SNP turns a non-CpG reference position into a "de-novo" CpG. Vice-versa, rastair also identifies positions where a variant disrupts a CpG and corrects their reported methylation levels. Rastair produces standard-compliant outputs in vcf, bam and bed formats, facilitating integration into downstream analyses pipelines. Rastair is open-source and available via conda, Dockerhub, and as pre-compiled binaries from https://www.rastair.com.
Khan, M. S. A.; Kabir, M. H.; Faisal, M. M.
Show abstract
Single-cell RNA sequencing (scRNA-seq) enables characterization of cellular heterogeneity but clustering remains challenging due to high dimensionality, dropout induced sparsity, and technical noise. Existing graph-based and contrastive methods often rely on predefined similarity measures or suffer from high computational costs on large datasets. We propose single-cell Transformer-based Graph Contrastive Learning (scTGCL), a framework integrating multi-head self-attention with graph contrastive learning to learn robust cell representations. The model projects raw expression data into an embedding space and employs multi-head attention to adaptively learn weighted cell-cell graphs capturing diverse biological relationships. For contrastive augmentation, we apply random gene masking at the feature level and random edge dropping on attention matrices, simulating dropout and structural uncertainty. A symmetric contrastive loss maximizes agreement between original and augmented representations, while joint optimization with reconstruction and imputation losses preserves biological interpretability. Experiments on ten real scRNA-seq datasets demonstrate that scTGCL consistently outperforms nine state-of-the-art methods across clustering accuracy, normalized mutual information, and adjusted Rand index. Ablation studies validate each architectural component, and robustness analysis on simulated data confirms stable performance under varying dropout rates and differential expression levels. Furthermore, scTGCL exhibits superior computational efficiency, achieving substantially lower runtime on large scale datasets compared with existing approaches. The framework provides an accurate, efficient, and scalable solution for single-cell clustering. Source code and datasets are available at https://github.com/ShoaibAbdullahKhan/scTGCL.
Forcier, T.; Cheng, E.; Tam, O. H.; Wunderlich, C.; Castilla-Vallmanya, L.; Jones, J. L.; Quaegebeur, A.; Barker, R. A.; Jakobsson, J.; Gale Hammell, M.
Show abstract
Transposable elements (TEs) are mobile genetic sequences that can generate new copies of themselves via insertional mutations. These viral-like sequences comprise nearly half the human genome and are present in most genome wide sequencing assays. While only a small fraction of genomic TEs have retained their ability to transpose, TE sequences are often transcribed from their own promoters or as part of larger gene transcripts. Accurately assessing TE expression from each individual genomic TE locus remains an open problem in the field, due to the highly repetitive nature of these multi-copy sequences. These issues are compounded in single-cell and single-nucleus transcriptome experiments, where additional complications arise due to sparse read coverage and unprocessed mRNA introns. Here we present our tool for single-cell TE and gene expression analysis, TEsingle. Using synthetic datasets, we show the problems that arise when not properly accounting for intron retention events, failing to address uncertainty in alignment scoring, and failing to make use of unique molecular identifiers for transcript resolution. Addressing these challenges has enabled an accurate TE analysis suite that simultaneously tracks gene expression as well as locus-specific resolution of expressed TEs. We showcase the performance of TEsingle using single-nucleus profiles from substantia nigra (SN) tissues of Parkinsons Disease (PD) patients. We find examples of young and intact TEs that mark dopaminergic neurons (DA) as well as many young TEs from the LINE and ERV families that are elevated in PD neurons and glia. These results demonstrate that TE expression is highly cell-type and cellular-state specific and elevated in particular subsets of neurons, astrocytes, and microglia from PD patients.
Xiang, Y.; Xiao, X.; Zhou, B.; Xie, L.
Show abstract
MotivationEnhancer-derived RNAs (eRNAs) and their fusion with protein-coding genes represent a crucial yet understudied layer of transcriptional regulation. eRNAs are typically expressed at low levels, which makes fusion events difficult to detect with conventional fusion detection tools. In addition, these tools are not designed to capture fusion transcripts arising from spatial proximity between distal regulatory elements and gene loci. Reads spanning such regions are also frequently filtered as mapping artifacts. As a result, computational approaches for systematically identifying spatially mediated enhancer-exon fusion transcripts remain lacking. MethodsWe developed ChiMER, a graph-based framework for detecting ChiMeric Enhancer RNAs from short-read RNA-seq data. ChiMER constructs splice graphs with chromatin contact information to introduce enhancer-exon edges and uses graph alignment to search for potential transcriptional paths. A ranking-based scoring module then prioritizes high-confidence events. Evaluations on simulated and real RNA-seq datasets show that ChiMER achieves higher sensitivity than conventional linear fusion detection methods while maintaining low false-positive rates. ResultsApplied to cancer cell line RNA-seq datasets, ChiMER identified multiple enhancer-exon chimeric transcripts, several associated with super-enhancer regions. Multi-omics analysis further shows that fusion transcripts occur in transcriptionally active regulatory environments and frequently coincide with strong R-loop signals, suggesting a potential role of RNA-DNA hybrid structures in facilitating long-range transcriptional joining events. Availabilityhttps://github.com/Candlelight-XYJ/ChiMER Contactyujia.xiang@outlook.com, xielinhai@ncpsb.org.cn
Tanner, R. M.; Perkins, T. J.
Show abstract
Histone modifications are a key component of the epigenetic state of a cell, and they vary widely across different cell and tissue types, conditions, and disease states. Indeed, the majority of the genome is enriched with one histone mark or another across the thousands of cellular conditions that have been studied to date. Here, we use the largest-to-date collection of histone modification ChIP-seq datasets to identify the most important sites of histone modifications genome-wide. Collected and uniformly reprocessed by the International Human Epigenome Consortium, this data includes 5339 datasets enriched at nearly one billion total peaks across 59 different major cell or tissue types and in healthy and disease conditions, for six different histone marks. We propose FindMetapeaks, a new approach to identifying histone mark metapeaks, which are genomic regions with enrichment of a mark across many samples. We show that many of these epigenetic metapeaks are strongly indicative of cell and tissue type, or are associated with other sample characteristics, and highlight key regulatory regions of the genome. However, we also show that many metapeaks contain redundant information, and that parsimonious subsets of metapeaks can be selected by machine learning to predict cell state. Our histone mark metapeak atlas provides a concise set of regions for interpreting the epigenome. Availabilityhttps://github.com/rmbioinfo83/FindMetapeaks/